This study addresses the critical public health challenge of Coronary Artery Disease (CAD), a leading cause of global morbidity and mortality, by aiming to improve upon current risk prediction methods. Existing tools often show limited accuracy, particularly in diverse populations or specific subgroups, and struggle to effectively integrate the growing wealth of genetic information. The primary objective was to develop and validate a novel machine learning framework – termed 'meta-prediction' – that integrates a wide range of unmodifiable risk factors (like age and numerous genetic predispositions summarized by Polygenic Risk Scores, or PRSs) with modifiable factors (clinical measurements, lifestyle information) to generate more accurate, personalized, and actionable 10-year CAD risk estimates.
Methodologically, the researchers utilized large-scale data from the UK Biobank (UKBB), strategically dividing participants into two groups: one with existing CAD ('prevalent cohort') and one initially free of CAD who were followed over time ('incident cohort'). The core innovation involved a two-stage process: first, training numerous baseline predictive models on the prevalent cohort to estimate various risk factors and diagnoses; second, using the outputs (predictions) from these baseline models as new input features – 'meta-features' – along with directly measured data, to train a final ensemble machine learning model (specifically XGBoost) predicting the 10-year risk of developing CAD in the incident cohort. This hierarchical approach allowed the model to learn complex patterns and interactions from over 1,700 initial features, ultimately selecting the 50 most informative ones, including 15 meta-features and 22 PRSs.
The resulting meta-prediction model demonstrated significantly improved performance. Within the UKBB test set, it achieved high discrimination (Area Under the Curve, AUC, a measure of model accuracy where 1 is perfect and 0.5 is random chance, was 0.84). Crucially, this high performance was largely maintained (AUC 0.81) upon external validation in the independent and diverse All of Us (AoU) research program cohort, indicating good generalizability. The model substantially outperformed standard clinical risk scores like the Pooled Cohort Equations (PCE) and QRISK3 (average AUC improvement >10%, average Area Under the Precision-Recall Curve improvement 67%), and also showed better risk reclassification (Net Reclassification Index 0.14-0.21). Performance gains were particularly notable in subgroups traditionally considered low-risk. Furthermore, model interpretation using SHAP values identified key risk drivers, and simulations suggested the framework could potentially guide personalized interventions by predicting differential risk reduction based on an individual's genetic profile and risk subgroup.
In conclusion, the study presents a powerful, integrative meta-prediction framework that significantly advances CAD risk prediction accuracy and generalizability compared to current standards. By effectively leveraging comprehensive genetic and non-genetic data through a sophisticated machine learning approach, the framework offers potential for more precise risk stratification and personalized prevention strategies. The findings underscore the importance of incorporating broad genetic information and complex interactions to capture individual CAD susceptibility more effectively, paving the way towards precision cardiology.
This study successfully developed and validated a sophisticated machine learning framework for predicting 10-year Coronary Artery Disease (CAD) risk, demonstrating superior performance compared to established clinical scores and previous research models. The 'meta-prediction' approach, integrating a vast array of genetic (numerous Polygenic Risk Scores - PRSs) and non-genetic factors (clinical, lifestyle, biomarker data) through intermediate predictive steps ('meta-features'), represents a significant methodological advance.
The model's high accuracy (AUC 0.84 in UK Biobank, 0.81 in the diverse All of Us cohort) and improved ability to reclassify individuals into more appropriate risk categories (Net Reclassification Index 0.14-0.21 vs standard scores) underscore its potential clinical utility. Particularly noteworthy is its enhanced performance in subgroups often considered lower risk by traditional methods (e.g., younger individuals, females), suggesting it captures risk pathways missed by simpler models. The framework's ability to simulate differential intervention benefits based on genetic risk profiles offers a promising avenue towards personalized prevention strategies, although these simulations are currently model-based hypotheses requiring prospective clinical validation.
While the findings are compelling, it is crucial to recognize the study's observational nature; the model identifies complex associations, but direct causation for all contributing factors is not established by this work alone. Although validated in the diverse All of Us cohort, further assessment in specific underrepresented populations and age groups is warranted. The practical implementation of such a complex model faces hurdles beyond predictive accuracy, including seamless integration into clinical workflows, clinician training and acceptance, regulatory approval, and demonstration of cost-effectiveness. The infrastructure for such deployment is emerging but not yet widespread.
In essence, this research provides a powerful demonstration of how advanced machine learning, applied to large-scale, multi-modal data including comprehensive genetics, can significantly refine cardiovascular risk prediction. It marks a substantial step towards more personalized and precise prevention of CAD. However, translating this potential into routine clinical impact requires overcoming implementation challenges and further validating the personalized intervention aspect through dedicated clinical trials.
The abstract clearly articulates the significant public health problem of CAD and the critical need for improved, personalized risk prediction, effectively setting the stage for the study's objectives.
The abstract concisely summarizes the complex meta-prediction methodology, including the use of distinct cohorts for training baseline and incident models and the concept of using baseline model outputs as meta-features.
Key quantitative results, including the model's high performance (AUC 0.84 in UK Biobank, 0.81 in All of Us) and its superiority over existing methods, are effectively highlighted, demonstrating the model's potential impact.
The abstract successfully conveys the translational potential of the framework by mentioning its ability to generate individualized risk reduction profiles and the finding that genetic risk influences intervention benefits, suggesting clinical utility.
This low-impact improvement would enhance immediate clarity for readers scanning the abstract. The Abstract section serves as a standalone summary, and explicitly stating the study design provides fundamental context upfront. Adding this information would slightly strengthen the abstract by immediately orienting the reader to the nature of the evidence presented, reinforcing the observational basis of the findings without requiring them to infer it or read further.
Implementation: In the sentence describing the data source, incorporate the study design type. For example, modify the sentence starting 'To power our meta-prediction approach...' to something like: 'Using data from the large-scale observational UK Biobank cohort, we stratified participants into two primary groups to power our meta-prediction approach...'
The Introduction effectively establishes the context by highlighting the limitations of current CAD risk prediction, particularly the debated utility of PRSs and the shortcomings of simple linear models in capturing the complexity of CAD.
The section provides a concise yet critical review of previous attempts to integrate genetic and clinical data, outlining why these approaches (linear combinations, brute-force feature inclusion) yielded only marginal improvements and failed to capture necessary interactions for personalization.
The Introduction clearly articulates the proposed 'omnigenic, integrative, meta-prediction framework,' explicitly stating its key differentiating features, such as incorporating numerous PRSs, using ML to detect interactions, and integrating predictions for multiple factors.
The text successfully outlines the specific goals of the framework, such as disentangling inherited from acquired risk and facilitating individualized risk assessment, thereby setting clear expectations for the study's aims and potential contributions.
This low-impact improvement would enhance the clarity of the study's central argument. The Introduction section effectively outlines the framework and its goals, but explicitly stating the core hypothesis would provide a sharper focus. Adding a concise hypothesis statement would strengthen the Introduction by clearly articulating the expected outcome of applying the novel framework, directly linking the proposed methodology to its anticipated advantages over existing approaches.
Implementation: At the end of the paragraph introducing the omnigenic framework, add a sentence stating the main hypothesis. For example: 'We hypothesize that this omnigenic, integrative, meta-prediction framework will achieve substantially higher accuracy and better risk stratification for incident CAD compared to existing clinical scores and simpler PRS integration methods, particularly by capturing complex interactions.'
The Results section clearly defines the study cohorts (prevalent and incident) derived from the UK Biobank, providing essential demographic and clinical characteristics, which establishes a solid foundation for understanding the subsequent analyses.
The process of generating meta-features and selecting the final 50 features for the model is described systematically, detailing the types of features included (measured, PRSs, meta-features) and their origins, enhancing methodological transparency.
The performance of the meta-prediction model is thoroughly reported using multiple standard metrics (AUROC, AUPRC, sensitivity, specificity, F1 score, Brier's score) and visualized effectively (Fig 1c-f), allowing for a comprehensive assessment of its predictive power.
The study rigorously compares the meta-prediction model against numerous established clinical risk scores (PCE, QRISK3, PREVENT) and PRS-based models using multiple metrics (AUROC, AUPRC, C-index, NRI, IDI), clearly demonstrating its superior performance.
The analysis extends beyond overall performance to demonstrate the model's robustness across various clinically relevant subpopulations (stratified by age, sex, baseline risk scores, biomarkers, PRSs), highlighting its particular strength in identifying risk within traditionally low-risk groups.
The use of SHAP values for model explanation (Fig 3) and subgroup identification via clustering (Fig 4) provides valuable insights into feature contributions and identifies distinct risk profiles, enhancing model interpretability and potential clinical utility.
The simulation of clinical interventions (LDL, HbA1c, SBP lowering) and the analysis of differential response based on genetic risk and identified subgroups (Fig 5) effectively demonstrate the framework's potential for personalized risk reduction strategies.
The external validation using a diverse cohort (All of Us) and the development of a generalizable genetic model demonstrate the framework's robustness and potential applicability beyond the initial UKBB training data, including across different ethnicities.
This low-impact improvement would enhance clarity regarding the intervention simulations. The Results section presents compelling simulations of risk reduction (Fig 5), but the rationale for selecting the specific clinical thresholds (e.g., PCE ≥ 7.5%, HbA1c ≥ 6%, SBP ≥ 140 mmHg) to define the 'at-risk' groups for these simulations isn't explicitly stated here. Briefly mentioning the clinical relevance or guideline basis for these thresholds within this section would provide immediate context for the reader evaluating the intervention analysis. Adding this detail directly supports the interpretation of Figure 5 and the subsequent discussion on differential responses. This enhancement would strengthen the transparency of the simulation methodology presented in the Results.
Implementation: When introducing the analysis shown in Figure 5 (paragraph starting 'Genetics and differential response...'), briefly incorporate the basis for the chosen thresholds. For example: 'Using our trained models, we simulated the influence of risk-reducing interventions... The clinical interventions, clinically relevant thresholds used to identify at-risk individuals (e.g., PCE ≥ 7.5% for statin consideration, HbA1c ≥ 6% indicating prediabetes/diabetes, SBP ≥ 140 mmHg indicating hypertension), perturbed biomarker and their clinical targets are described in Supplementary Table 12.'
This medium-impact improvement would enhance the interpretation of the identified risk subgroups. The Results section successfully identifies five distinct CAD risk subgroups using SHAP value clustering (Fig 4a) and details the features differentiating them (Fig 4b, 4c). However, the clinical interpretation or potential phenotypic meaning of these statistically derived subgroups could be briefly elaborated upon here. Connecting the key differentiating features (e.g., high genetic risk via specific PRSs, high ASCVD meta-feature prediction) to potential underlying clinical profiles would help translate the statistical findings into more tangible concepts for the reader within the Results section itself. This addition would strengthen the narrative by providing initial insights into why these subgroups are different, complementing the detailed characteristics in the supplement. This enhancement directly aids in understanding the significance of the subgroup analysis presented in Figure 4.
Implementation: After describing the features with the largest effect sizes differentiating the subgroups (e.g., end of the paragraph discussing Fig 4c), add a sentence or two providing a brief clinical interpretation. For example: 'These features differentiated the subgroups to a greater extent than age or sex as measured by η2 (Fig. 4c), suggesting these subgroups may represent distinct underlying pathophysiological pathways, such as those driven primarily by inherited atherosclerotic predisposition (high CAD PRS and ASCVD meta-feature subgroups) versus those potentially influenced more by other factors despite similar overall predicted risk.'
Fig. 1| Overview of cohort construction, model development and performance assessment for 10-year incident CAD risk meta-prediction in the UKBB.
Fig. 2 | Comparative performance of meta-prediction stratified by standard risk factors in the UKBB population.
Fig. 3 | SHAP summary plot of features in the meta-prediction framework in the UKBB population.
Fig. 4 | Identification of CAD risk subgroups and distinguishing features in the UKBB population.
Fig. 5 | Benefit of clinical interventions in genetic risk and risk subgroups in the UKBB population.
Extended Data Fig. 1 | Feature importance and SHAP summary for 10-year prospective CAD risk prediction in the UK Biobank.
Extended Data Fig. 2 | Evaluating the calibration and predictive value of feature categories for the meta-prediction model in the UK Biobank.
Extended Data Fig. 3 | Comparative performance of CAD risk prediction models in the UK Biobank.
Extended Data Fig. 4 | SHAP summary plots for meta-features in the final model in the UK Biobank.
The Discussion effectively synthesizes the core finding: the meta-prediction framework's superior performance in prospective CAD risk prediction compared to existing clinical and research standards, clearly stating the main contribution.
A key strength of the framework – the comprehensive dissection and integration of various risk factor types (past/present/future, modifiable/unmodifiable, genetic/acquired) – is clearly articulated, highlighting its methodological advantage.
The Discussion emphasizes the crucial role of genetic risk, particularly through meta-features derived from unmodifiable factors, in achieving superior and actionable risk profiles, reinforcing the study's central theme.
The section effectively addresses the limitations of current standard risk scores (PCE, QRISK3), citing specific issues like underestimation in younger individuals and overestimation overall, and positions the meta-prediction model as a potential solution.
The Discussion thoughtfully outlines potential future enhancements, such as incorporating longitudinal data, improving EHR data capture, and expanding population diversity, demonstrating critical self-assessment and forward thinking.
The concluding paragraph effectively summarizes the main contributions – superior prediction (especially in low-risk groups), actionable insights for interventions, and the demonstration of genetic risk's power – providing a strong take-home message.
This medium-impact improvement would enhance the discussion of personalized interventions. The Discussion highlights that genetic risk mediates differential benefits (Fig 5), but could elaborate how this might occur mechanistically or speculatively. This belongs in the Discussion as it involves interpreting the findings and their biological/clinical implications. Adding potential explanations (e.g., gene-drug interactions, genetic influence on baseline risk factor levels driving intervention headroom, pathway-specific effects) would strengthen the paper by moving beyond observation towards potential underlying reasons for the differential responses, stimulating further research into personalized prevention mechanisms.
Implementation: Expand the paragraph discussing differential benefits (currently ending with '...mediating the differential benefits of standard interventions.') Add sentences speculating on potential mechanisms. For example: 'This mediation might occur through various pathways, such as genetic influences on baseline biomarker levels creating different potential ranges for improvement, pharmacogenetic effects altering response to specific therapies (e.g., statins), or interactions where genetic predisposition heightens sensitivity to specific modifiable risk factors targeted by interventions.'
This low-impact improvement would enhance the discussion around model generalizability. The Discussion mentions the development of a generalizable genetic model excluding UKBB-derived PRSs to mitigate overfitting concerns. However, it could more explicitly compare the implications of this model's strong performance (AUC 0.80) versus the main model (AUC 0.84). This comparison fits well within the Discussion's role of interpreting results and limitations. Adding this comparison would strengthen the argument by highlighting that while the full model leverages potentially cohort-specific PRS information for maximal accuracy, a substantial portion of the predictive power comes from broadly applicable genetic risk, reinforcing the core importance of genetics beyond specific PRS derivations.
Implementation: In the paragraph discussing limitations and the generalizable model, add a sentence explicitly comparing its performance implications to the main model. For example, after stating the generalizable model showed superior accuracy and calibration, add: 'While the main model achieved slightly higher overall accuracy, likely by leveraging UKBB-specific PRS information, the strong performance of the generalizable model underscores that the core predictive power stems significantly from broadly applicable genetic risk architecture, reinforcing the robustness of integrating genetics even independent of potentially cohort-optimized PRSs.'
This low-impact improvement would add nuance to the discussion of clinical implementation. The Discussion mentions that the necessary infrastructure for deploying such models 'has or is being established' at leading medical centers. This statement could be slightly qualified by acknowledging potential real-world barriers beyond infrastructure. This fits the Discussion's role of considering practical implications and limitations. Adding a brief mention of challenges like data integration hurdles, clinician training, cost-effectiveness analyses, or regulatory pathways would strengthen the paper by presenting a more complete picture of the translational pathway, managing expectations about immediate widespread deployment.
Implementation: In the paragraph discussing clinical workflow integration, modify the sentence about infrastructure. For example: 'At leading medical centers, the infrastructure necessary for the deployment of these types of predictive models has or is being established, although broader implementation will also require addressing challenges related to seamless EHR integration, clinician education, cost-effectiveness, and regulatory considerations.'
Extended Data Fig. 6 | SHAP explanation of streamlined meta-prediction in UK Biobank and All of Us research program.
Extended Data Fig. 7 | Overview of generalizable genetic meta-prediction model in the UK Biobank.
Extended Data Fig. 8 | Feature importance and SHAP summary for 10-year prospective CAD risk prediction of generalizable genetic model in the UK Biobank.
The manuscript clearly defines the two primary UK Biobank cohorts (prevalent and incident) used, including sample sizes, age range, and the rationale for their use in the meta-prediction approach. Detailed exclusion criteria and control assignment methods are provided, enhancing transparency.
The methods provide a comprehensive description of both modifiable (sociodemographic, lifestyle, clinical, lab) and unmodifiable (genetics, PRS, ancestry, demographics) features. Crucially, it details data handling procedures like encoding, imputation (MICE), aggregation, and the generation of synthetic features and conventional risk scores.
The process for generating Polygenic Risk Scores (PRSs) is well-documented, including genotype imputation methods, PRS selection criteria (from GWAS, PGS Catalog), exclusion rules to prevent overfitting (e.g., removing UKBB-derived PRSs), and standardization.
The manuscript clearly outlines the machine learning pipeline, including the types of models considered (tree-based), train-test splitting strategy, feature selection methodology (approximate SHAP, 'zoish' wrapper), hyperparameter tuning approach (Optuna via 'lohrasb'), and model evaluation metrics (F1, R2).
The novel meta-prediction strategy is explicitly detailed, explaining how baseline models trained on the prevalent cohort generate meta-features (predicting baseline/future risk factors and diagnoses) that are subsequently used in the final incident CAD prediction model.
The methods for subgroup identification (SHAP-based hierarchical clustering) and therapeutic prioritization (intervention simulation by modulating risk factors) are clearly described, linking the predictive model to potential clinical applications.
The inclusion of a detailed external validation strategy using the All of Us (AoU) cohort, including feature mapping, handling of missing data, PRS calculation adjustments, and the training of specific streamlined/generalizable models, significantly strengthens the study's claims of robustness.
Adherence to reporting guidelines (MI-CLAIM, TRIPOD) is stated, and details regarding software versions and code availability (via GitHub) are provided, promoting transparency and reproducibility.
This low-impact improvement would enhance methodological justification. The Methods section describes using MICE via miceRanger with 5 iterations for imputation and specifically imputing 12 important features with >20% missingness. While the procedure is clear, briefly stating the rationale for choosing MICE over other methods (if specific reasons exist beyond common practice) and justifying the choice of 5 iterations could slightly strengthen the methodological rigor. Similarly, explicitly stating why those specific 12 high-missingness features (e.g., smoking) were deemed crucial enough to impute despite high missingness would add clarity. This detail belongs in the Methods section as it pertains directly to data preprocessing choices.
Implementation: In the 'Modifiable predictive features' subsection, after describing the imputation process, add a sentence clarifying the choice of method and parameters, if applicable (e.g., 'MICE was chosen for its flexibility in handling different variable types, and 5 iterations were deemed sufficient based on convergence diagnostics in preliminary analyses.') Also, add a brief justification for imputing the 12 high-missingness features (e.g., 'These 12 features, including key smoking variables, were imputed despite >20% missingness due to their established strong association with CAD risk.')
This low-impact improvement would add minor detail for reproducibility. The Methods section states that 307 PRSs were excluded if they had less than 80% overlap of SNPs with the study's genotyping data. While excluding PRSs with poor SNP overlap is standard, explicitly stating the rationale for the 80% threshold (e.g., 'chosen to ensure sufficient genetic information content for reliable score calculation' or 'consistent with common practice in PRS studies') provides slightly more justification. This detail fits within the Methods section's purpose of detailing procedural choices.
Implementation: In the 'Unmodifiable predictive features' subsection, when describing PRS exclusions, add a brief justification for the 80% SNP overlap threshold. For example: '...or had less than 80% overlap of SNPs with our genotyping data were dropped, as this threshold ensures adequate variant coverage for robust score calculation.'
This medium-impact improvement would enhance the reproducibility of the subgroup analysis. The Methods section details the use of agglomerative hierarchical clustering with Ward's linkage and Euclidean distance for subgroup identification, stating the number of subgroups was defined by fixed-height tree cutting. However, the specific height used for cutting the dendrogram, or alternatively, the method used to determine the optimal number of clusters (resulting in five), is not specified here. Including this parameter is crucial for others attempting to replicate the clustering results. This information is essential for the Methods section.
Implementation: In the 'Subgroup identification and therapeutic prioritization' subsection, specify the fixed height used for tree cutting or the method used to determine the number of clusters (five). For example: 'The number of subgroups was defined by cutting the resultant dendrogram at a fixed height of [specify height]...' or 'The optimal number of clusters was determined to be five using [specify method, e.g., silhouette analysis, gap statistic] prior to applying fixed-height tree cutting.'
This low-impact improvement would slightly enhance immediate clarity. The Methods section mentions the use of custom wrapper packages 'zoish' and 'lohrasb' for feature selection and hyperparameter tuning, respectively, and provides a GitHub link. While code availability is excellent for reproducibility, adding a brief descriptive phrase within the text about the primary function or advantage of these custom packages (e.g., 'our ‘zoish’ wrapper package, which facilitates efficient SHAP calculation using...') would improve the section's standalone readability for those not immediately consulting the code. This fits within the Methods' goal of clearly describing the tools used.
Implementation: In the 'Feature selection and development of the ML pipelines' subsection, add a brief descriptive clause when introducing 'zoish' and 'lohrasb'. For example: 'We then used our ‘zoish’ wrapper package, which incorporates the v2 algorithm of the SHAP Tree Explainer from the fasttreeshap package to enable rapid feature importance assessment...' and 'Hyperparameter tuning was automated using our lohrasb module, which integrates Optuna’s TPE sampler and Hyperband pruner for efficient search space exploration...'
Extended Data Fig. 9 | Feature distribution pre- and post-imputation in the UK Biobank.
Extended Data Table 1 | Baseline characteristics of the UK Biobank participants in the study (n=339,667)